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Recent work on the structure of social networks and the internet has focussed attention on graphs 
with distributions of vertex degree that are significantly different from the Poisson degree distri- 
butions that have been widely studied in the past. In this paper we develop in detail the theory 
of random graphs with arbitrary degree distributions. In addition to simple undirected, unipartite 
graphs, we examine the properties of directed and bipartite graphs. Among other results, we derive 
exact expressions for the position of the phase transition at which a giant component first forms, the 
mean component size, the size of the giant component if there is one, the mean number of vertices 
a certain distance away from a randomly chosen vertex, and the average vertex-vertex distance 
within a graph. We apply our theory to some real-world graphs, including the world-wide web and 
collaboration graphs of scientists and Fortune 1000 company directors. We demonstrate that in 
some cases random graphs with appropriate distributions of vertex degree predict with surprising 
accuracy the behavior of the real world, while in others there is a measurable discrepancy between 
theory and reality, perhaps indicating the presence of additional social structure in the network that 
is not captured by the random graph. 



I. INTRODUCTION 

A random graph [Q| is a collection of points, or vertices, 
with lines, or edges, connecting pairs of them at random 
(Fig. |l|a). The study of random graphs has a long his- 
tory. Starting with the influential work of Paul Erdos 
and Alfred Renyi in the 1950s and 1960s (|-||], random 
graph theory has developed into one of the mainstays of 
modern discrete mathematics, and has produced a prodi- 
gious number of results, many of them highly ingenious, 
describing statistical properties of graphs, such as distri- 
butions of component sizes, existence and size of a giant 
component, and typical vertex-vertex distances. 

In almost all of these studies the assumption has been 
made that the presence or absence of an edge between 
two vertices is independent of the presence or absence of 
any other edge, so that each edge may be considered to 
be present with independent probability p. If there are N 
vertices in a graph, and each is connected to an average 
of z edges, then it is trivial to show that p — z/{N — 1), 
which for large TV is usually approximated by z/N. The 
number of edges connected to any particular vertex is 
called the degree k of that vertex, and has a probability 
distribution pk given by 

where the second equality becomes exact in the limit of 
large TV. This distribution we recognize as the Poisson 
distribution: the ordinary random graph has a Poisson 
distribution of vertex degrees, a point which turns out to 
be crucial, as we now explain. 

Random graphs are not merely a mathematical toy; 
they have been employed extensively as models of real- 




FIG. 1. (a) A schematic representation of a random graph, 
the circles representing vertices and the lines edges, (b) A 
directed random graph, i.e., one in which each edge runs in 
only one direction. 

world networks of various types, particularly in epidemi- 
ology. The passage of a disease through a community de- 
pends strongly on the pattern of contacts between those 
infected with the disease and those susceptible to it. This 
pattern can be depicted as a network, with individuals 
represented by vertices and contacts capable of transmit- 
ting the disease by edges. The large class of epidemio- 
logical models known as susceptible/infectious/recovered 
(or SIR) models 1^-0 makes frequent use of the so-called 
fully mixed approximation, which is the assumption that 
contacts are random and uncorrelated, i.e., that they 
form a random graph. 

Random graphs however turn out to have severe short- 
comings as models of such real-world phenomena. Al- 
though it is difficult to determine experimentally the 
structure of the network of contacts by which a disease 
is spread |^ , studies have been performed of other social 
networks such as networks of friendships within a variety 
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of communities P^ITlll , networks of telephone calls ||l2|,|T3| , 
airline timetables , and the power grid , as well as 
networks in physical or biological systems, including neu- 
ral networks [[l5[, the structure and conformation space 
of pol ymers ]16| , |17[ , metabolic pathways [p8l|l9] , and food 



webs 1^0. It is found 0,0 that the distribution 
of vertex degrees in many of these networks is measur- 
ably different from a Poisson distribution — often wildly 
different — and this strongly suggests, as has been em- 
phasized elsewhere |2^, that there are features of such 
networks which we would miss if we were to approximate 
them by an ordinary (Poisson) random graph. 

Another very widely studied network is the internet, 
whose structure has attracted an exceptional amount of 
scrutiny, academic and otherwise, following its meteoric 
rise to public visibility starting in 1993. Pages on the 
world-wide web may be thought of as the vertices of a 
graph and the hyperlinks between them as edges. Empir- 
ical studies [p3|-^] have shown that this graph has a dis- 
tribution of vertex degree which is heavily right-skewed 
and possesses a fat (power-law) tail with an exponent 
between —2 and —3. (The underlying physical struc- 
ture of the internet also has a degree distribution of this 
type p^.) This distribution is very far from Poisson, and 
therefore we would expect that a simple random graph 
would give a very poor approximation of the structural 
properties of the web. However, the web differs from a 
random graph in another way also: it is directed. Links 
on the web lead from one page to another in only one 
direction (see Fig. [|b). As discussed by Broder et al. |2^] 
this has a significant practical effect on the typical acces- 
sibility of one page from another, and this effect also will 
not be captured by a simple (undirected) random graph 
model. 

A further class of networks that has attracted scrutiny 
is the class of collaboration networks. Examples of 
such networks include the boards of directors of compa- 
nies p^-|3l|, co-ownership networks of companies 
and collaborations of scientists |33-37| and movie ac- 
tors As well as having strongly non-Poisson de- 
gree distributions 36|, these networks have a bipartite 
structure; there are two distinct kinds of vertices on the 
graph with links running only between vertices of unlike 
kinds |38| — see Fig. I In the case of movie actors, for 
example, the two types of vertices are movies and actors, 
and the network can be represented as a graph with edges 
running between each movie and the actors that appear 
in it. Researchers have also considered the projection of 
this graph onto the unipartite space of actors only, also 
called a one-mode network [Q. In such a projection two 
actors are considered connected if they have appeared 
in a movie together. The construction of the one-mode 
network however involves discarding some of the infor- 
mation contained in the original bipartite network, and 
for this reason it is more desirable to model collaboration 
networks using the full bipartite structure. 

Given the high current level of interest in the structure 
of many of the graphs described here, and given their sub- 





FIG. 2. A schematic representation (top) of a bipartite 
graph, such as the graph of movies and the actors who have 
appeared in them. In this small graph we have four movies, 
labeled 1 to 4, and eleven actors, labeled A to K, with edges 
joining each movie to the actors in its cast. In the lower part 
of the picture we show the one-mode projection of the graph 
for the eleven actors. 



stantial differences from the ordinary random graphs that 
have been studied in the past, it would clearly be useful 
if we could generalize the mathematics of random graphs 
to non-Poisson degree distributions, and to directed and 
bipartite graphs. In this paper we do just that, demon- 
strating in detail how the statistical properties of each of 
these graph types can be calculated exactly in the limit 
of large graph size. We also give examples of the ap- 
plication of our theory to the modeling of a number of 
real-world networks, including the world-wide web and 
collaboration graphs. 



II. RANDOM GRAPHS WITH ARBITRARY 
DEGREE DISTRIBUTIONS 

In this section we develop a formalism for calculating 
a variety of quantities, both local and global, on large 
unipartite undirected graphs with arbitrary probability 
distribution of the degrees of their vertices. In all re- 
spects other than their degree distribution, these graphs 
are assumed to be entirely random. This means that 
the degrees of all vertices are independent identically- 
distributed random integers drawn from a specified dis- 
tribution. For a given choice of these degrees, also called 
the "degree sequence," the graph is chosen uniformly at 
random from the set of all graphs with that degree se- 
quence. All properties calculated in this paper are aver- 
aged over the ensemble of graphs generated in this way. 
In the limit of large graph size an equivalent procedure is 
to study only one particular degree sequence, averaging 
uniformly over all graphs with that sequence, where the 
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sequence is chosen to approximate as closely as possible 
the desired probability distribution. The latter proce- 
dure can be thought of as a "microcanonical ensemble" 
for random graphs, where the former is a "canonical en- 
semble." 

Some results are already known for random graphs 
with arbitrar y degr ee distributions: in two beautiful re- 
cent papers MoUoy and Reed have derived for- 
mulas for the position of the phase transition at which 
a giant component first appears, and the size of the gi- 
ant component. (These results are calculated within the 
microcanonical ensemble, but apply equally to the canon- 
ical one in the large system size limit.) The formalism we 
present in this paper yields an alternative derivation of 
these results and also provides a framework for obtaining 
other quantities of interest, some of which we calculate. 
In Sections HI and [V we extend our formalism to the 



case of directed graphs (such as the world-wide web) and 
bipartite graphs (such as collaboration graphs). 



A. Generating functions 

Our approach is based on generating functions p[ | , the 
most fundamental of which, for our purposes, is the gen- 
erating function Gq (x) for the probability distribution of 
vertex degrees k. Suppose that we have a unipartite undi- 
rected graph — an acquaintance network, for example — of 
N vertices, with N large. We define 



Goix) 



k=0 



(2) 



where pk is the probability that a randomly chosen ver- 
tex on the graph has degree k. The distribution pi- is 
assumed correctly normalized, so that 



Go(l) 



1. 



(3) 



The same will be true of all generating functions consid- 
ered here, with a few important exceptions, which we will 
note at the appropriate point. Because the probability 
distribution is normalized and positive definite, Go{x) is 
also absolutely convergent for all \x\ < 1, and hence has 
no singularities in this region. All the calculations of this 
paper will be confined to the region |a;| < 1. 

The function Gq (x) , and indeed any probability gener- 
ating function, has a number of properties that will prove 
useful in subsequent developments. 

Derivatives The probability pk is given by the k^^ 
derivative of Go according to 



Pk 



1 d'^Go 



(4) 



x=0 



Thus the one function Go{x) encapsulates all the in- 
formation contained in the discrete probability distribu- 
tion p^. We say that the function Go(x) "generates" the 
probability distribution pk- 



Moments The average over the probability distribu- 
tion generated by a generating function — for instance, 
the average degree z of a vertex in the case of Go(x) — is 
given by 



= {k)=Y,kpk = G',il). 



(5) 



Thus if we can calculate a generating function we can also 
calculate the mean of the probability distribution which 
it generates. Higher moments of the distribution can be 
calculated from higher derivatives also. In general, we 
have 



dx 



Go(x) 



(6) 



Powers If the distribution of a property k of an object 
is generated by a given generating function, then the dis- 
tribution of the total of k summed over m independent 
realizations of the object is generated by the m*** power 
of that generating function. For example, if we choose 
m vertices at random from a large graph, then the dis- 
tribution of the sum of the degrees of those vertices is 
generated by [Go{x)]'^. To see why this is so, consider 
the simple case of just two vertices. The square [Go(cc)]^ 
of the generating function for a single vertex can be ex- 
panded as 



E 



PkX 



= ^PjPkX' 
jk 

PqPqX° + {popi + piPo)x^ 
+ {P0P2 +PlPl +P2Po)x'^ 
+ (P0P3 + PlP2 + P2P1 + P3Po)x^ + 



(7) 



It is clear that the coefficient of the power of x" in this 
expression is precisely the sum of all products pjPk such 
that j + k = n, and hence correctly gives the probability 
that the sum of the degrees of the two vertices will be n. 
It is straightforward to convince oneself that this prop- 
erty extends also to all higher powers of the generating 
function. 

All of these properties will be used in the derivations 
given in this paper. 

Another quantity that will be important to us is the 
distribution of the degree of the vertices that we arrive 
at by following a randomly chosen edge. Such an edge 
arrives at a vertex with probability proportional to the 
degree of that vertex, and the vertex therefore has a prob- 
ability distribution of degree proportional to kpk- The 
correctly normalized distribution is generated by 



J2k kpkx^ 
Hk^Pk 



'G',{iy 



(8) 
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If we start at a randomly chosen vertex and follow 
each of the edges at that vertex to reach the k nearest 
neighbors, then the vertices arrived at each have the dis- 
tribution of remaining outgoing edges generated by this 
function, less one power of x, to allow for the edge that 
we arrived along. Thus the distribution of outgoing edges 
is generated by the function 



(9) 



where z is the average vertex degree, as before. The prob- 
ability that any of these outgoing edges connects to the 
original vertex that we started at, or to any of its other 
immediate neighbors, goes as N~'^ and hence can be ne- 
glected in the limit of large N. Thus, making use of the 
"powers" property of the generating function described 
above, the generating function for the probability distri- 
bution of the number of second neighbors of the original 
vertex can be written as 



(10) 



Similarly, the distribution of third-nearest neighbors is 
generated by Go(Gi(Gi(a;))), and so on. The average 
number Z2 of second neighbors is 



Z2 



dx 



Go(Gi(x)) 



g^,(i)g;(i)^g^'(i), (11) 



where we have made use of the fact that Gi (1) = 1. (One 
might be tempted to conjecture that since the average 
number of first neig hbors is Go(l), Eq. (|), and the aver- 
age number of second neighbors is Go(l), Eq. (pT|), then 
the average number of mth neighbors should be given by 
the mth derivative of Gq evaluated at a; = 1. As we show 



in Section II F, however, this conjecture is wrong.) 



B. Examples 

To make things more concrete, we immediately intro- 
duce some examples of specific graphs to illustrate how 
these calculations are carried out. 

a. Poisson- distributed graphs The simplest example 
of a graph of this type is one for which the distribution of 
degree is binomial, or Poisson in the large N limit. This 
distribution yields the standard random graph studied 
by many mathematicians and discussed in Section ||. In 
this graph the probability p = z/N oi the existence of an 
edge between any two vertices is the same for all vertices, 
and Go(x) is given by 



N 



G.{x) = Y^r \p\i-pr-^x^ 

fc=0 ^ ^ 

= (1 - p+pxY 



(12) 



where the last equality applies in the limit — >• oo. It is 
then trivial to show that the average degree of a vertex is 
indeed Gq(1) = z and that the probability distribution of 
degree is given by pk = z'^e~^ /kl, which is the ordinary 
Poisson distribution. Notice also that for this special case 
we have Gi{x) = Go(x), so that the distribution of out- 
going edges at a vertex is the same, regardless of whether 
we arrived there by choosing a vertex at random, or by 
following a randomly chosen edge. This property, which 
is peculiar to the Poisson-distributed random graph, is 
the reason why the theory of random graphs of this type 
is especially simple. 

b. Exponentially distributed graphs Perhaps the next 
simplest type of graph is one with an exponential distri- 
bution of vertex degrees 



Pk 



a 



(13) 



where k is a constant. The generating function for this 
distribution is 



Go(:r) = (l-e-V-)^e- 



k/K^k 



k=0 



1 _ 

1 — xe"^/" ■ 



and 



Gi(x) = 



1 - e-i/" 



1 — xe 



(14) 



(15) 



An example of a graph with an exponential degree dis- 
tribution is given in Section VA. 

c. Power-law distributed graphs The recent interest 
in the properties of the world-wide web and of social net- 
works leads us to investigate the properties of graphs with 
a power-law distribution of vertex degrees. Such graphs 
have been discussed previously by Barabasi et al. 23 
and by Aiello et al. ||l^. In this paper, we will look at 
graphs with degree distribution given by 



Pk = Ck ■^e 



-kj K 



for fc > 1. 



(16) 



where G, r, and k are constants. The reason for including 
the exponential cutoff is two-fold: first many real-world 
graphs appear to show this cutoff ||l^,Q ; second it makes 
the distribution normalizable for all t, and not just r > 2. 

The constant G is fixed by the requirement of normal- 
ization, which gives G = [Li^(e-i/'")]-i and hence 



Vk 



Li^(e-i/«) 



for fc > 1, 



(17) 



where Li„(a;) is the nth polylogarithm of a;, a function 
familiar to those who have worked with Feynman inte- 
grals. 

Substituting ( |l7| ) into Eq. (||), we find that the gen- 
erating function for graphs with this degree distribution 
is 



Go(x) 



Li^(xe-i/'') 
Li^(e-i/'-) 



(18) 
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In the limit k oo — the case considered in Refs. jl^ 
and p3] — this simphfies to 



Go{x) = 



ljir{x) 



(19) 



where ^(x) is the Riemann ^-function. 
The function Gi{x) is given by 



Giix) 



Li^_i(xe-i/"^) 
a;Li^_i(e-i/K)' 



(20) 



Thus, for example, the average number of neighbors of a 
randomly-chosen vertex is 



Li,_i(e-i/'') 



Li^(c-i/«^) ' 
and the average number of second neighbors is 



(21) 



„, , Li^_2(e~^/'') -Li^-i(c-i/'') , , 

d. Graphs with arbitrary specified degree distribution 
In some cases we wish to model specific real- world graphs 
which have known degree distributions — known because 
we can measure them directly. A number of the graphs 
described in the introduction fall into this category. For 
these graphs, we know the exact numbers Uk of vertices 
having degree k, and hence we can write down the exact 
generating function for that probability distribution in 
the form of a finite polynomial: 



Go{x) 



(23) 



where the sum in the denominator ensures that the gen- 
erating function is properly normalized. As a example, 
suppose that in a commmiity of 1000 people, each person 
knows between zero and five of the others, the exact num- 
bers of people in each category being, from zero to five: 
{86, 150, 363, 238, 109, 54}. This distribution will then be 
generated by the polynomial 



G^{x) - 



150x + 363a;2 + 238x3 + 109a;'' + ^Ax^ 



1000 



(24) 



C. Component sizes 

We are now in a position to calculate some proper- 
ties of interest for our graphs. First let us consider the 
distribution of the sizes of connected components in the 
graph. Let Hi [x) be the generating function for the dis- 
tribution of the sizes of components which are reached 
by choosing a random edge and following it to one of its 
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+ 
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FIG. 3. Schematic representation of the sum rule for the 
connected component of vertices reached by following a ran- 
domly chosen edge. The probability of each such component 
(left-hand side) can be represented as the sum of the probabil- 
ities (right-hand side) of having only a single vertex, having a 
single vertex connected to one other component, or two other 
components, and so forth. The entire sum can be expressed 
in closed form as Eq. (Pq). 



ends. We explicitly exclude from Hi{x) the giant com- 
ponent, if there is one; the giant component is dealt with 
separately below. Thus, except when we are precisely 
at the phase transition where the giant component ap- 
pears, typical component sizes are finite, and the chances 
of a component containing a closed loop of edges goes as 
N~^, which is negligible in the limit of large N. This 
means that the distribution of components generated by 
Hi{x) can be represented graphically as in Fig. ^; each 
component is tree-like in structure, consisting of the sin- 
gle site we reach by following our initial edge, plus any 
number (including zero) of other tree-like clusters, with 
the same size distribution, joined to it by single edges. If 
we denote by qk the probability that the initial site has 
k edges coming out of it other than the edge we came in 
along, then, making use of the "powers" property of Sec- 
tion 



II A , Hi (x) must satisfy a self-consistency condition 



of the form 



Hi{x) = xqo + xqiHi{x) + xq2[Hi{x)]'^ + 



(25) 



However, qk is nothing other than the coefficient of x'^ 
in the generating function Gi{x), Eq. (|^), and hence 
Eq. (p5|) can also be written 



Hiix)^xGi{Hi{x)). 



(26) 



If we start at a randomly chosen vertex, then we have 
one such component at the end of each edge leaving that 
vertex, and hence the generating function for the size of 
the whole component is 



Hoix)^xGoiHiix)). 



(27) 



In principle, therefore, given the functions Go{x) and 
Gi{x), we can solve Eq. (26) for Hi{x) and substitute 
into Eq. ( ^7| ) to get Hq{x). Then we can find the proba- 
bility that a randomly chosen vertex belongs to a compo- 
nent of size s by taking the sth derivative of Hq. In prac- 
tice, unfortunately, this is usually impossible; Eq. ( [2^ ) 
is a complicated and frequently transcendental equation, 
which rarely has a known solution. On the other hand, 
we note that the coefficient of x'* in the Taylor expansion 
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of Hi (x) (and therefore also the sth derivative) are given 
exactly by only s + 1 iterations of Eq. (|27| ) , starting with 
i/i = 1, so that the distribution generated by Ho{x) can 
be calculated exactly to finite order in finite time. With 
current symbolic manipulation programs, it is quite pos- 
sible to evaluate the first one hundred or so derivatives 
in this way. Failing this, an approximate solution can be 
found by numerical iteration and the distribution of clus- 
ter sizes calculated from Eq. (Q) by numerical differenti- 
ation. Since direct evaluation of numerical derivatives 
is prone to machine-precision problems, we recommend 
evaluating the derivatives by numerical integration of the 
Cauchy formula, giving the probability distribution Ps of 
cluster sizes thus: 



Ps = 



1 d'Hn 



;! dz* 



1 
27ri 



dz. 



(28) 



The best numerical precision is obtained by using the 
largest possible contour, subject to the condition that it 
enclose no poles of the generating function. The largest 
contour for which this condition is satis fied in general is 
the unit circle |z| = 1 (see Section [I A ), and we recom- 
mend using this contour for Eq. (28). ft is possible to 
find the first thousand derivatives of a function without 
difhculty using this method E^. 



D. The mean component size, the phase transition, 
and the giant component 

Although it is not usually possible to find a closed-form 
expression for the complete distribution of cluster sizes 
on a graph, we can find closed-form expressions for the 
average properties of clusters from Eqs. ( p6| ) and (27). 
For example, the average size of the component to which 
a randomly chosen vertex belongs, for the case where 
there is no giant component in the graph, is given in the 
normal fashion by 



From Eq. ( |26| ) we have 

H[{l) = l + G[il)H[il), 

and hence 

g;,(i) 



1 



l-GUl) Z1-Z2 



(29) 



(30) 



(31) 



where zi = z is the average number of neighbors of a 
vertex and Z2 is the average number of second neighbors. 
We see that this expression diverges when 



G'i(l) = l. 



(32) 



This point marks the phase transition at which a giant 
component first appears. Substituting Eqs. and (|^) 



into Eq. (p^), we can also write the condition for the 
phase transition as 



(33) 



Indeed, since this sum increases monotonically as edges 
are added to the graph, it follows that the giant compo- 
nent exists if and only if this sum is positive. This re- 
sult has been derived by different means by MoUoy and 
Reed . An equivalent and intuitively reasonable state- 
ment, which can also be derived from Eq. (|3l|), is that 
the giant component exists if and only if Z2 > zi. 

Our generating function formalism still works when 
there is a giant component in the graph, but, by defi- 
nition, Hq{x) then generates the probability distribution 
of the sizes of components excluding the giant compo- 
nent. This means that Ho(l) is no longer unity, as it is 
for the other generating functions considered so far, but 
instead takes the value 1 — S, where S is the fraction of 
the graph occupied by the giant component. We can use 
this to calculate the size of the giant component from 
Eqs. (M) and (^) thus: 



S=l-Go{u), 



(34) 



where u = i?i(l) is the smallest non-negative real solu- 
tion of 



u = Gi{u). 



(35) 



This result has been derived in a different but equivalent 
form by MoUoy and Reed [0, using different methods. 

The correct general expression for the average compo- 
nent size, excluding the (formally infinite) giant compo- 
nent, if there is one, is 



mi 
1 



HoiD 
1 + 



Go(i/i(l)) 



G',mi))GiiHi{l)) 
l-G[{Hi{l)) 



[1 - s][i - G[iu)r 



(36) 



which is equivalent to ( ^l| ) when there is no giant com- 
ponent {S — 0, u = I). 

For example, in the ordinary random graph with Pois- 
son degree distribution, we have Go{x) = Gi{x) — 
^-gq^ (|l^))^ and hence we find simply that 1 — S = 
u is a solution of u ^ Go(u), or equivalently that 



The average component size is given by 

1 



1 — z + zS 
These are all well-known results |l] . 



(37) 



(38) 
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For graphs with purely power-law distributions 
(Eq. (||) with K ^ oo), S' is given by (||) with u the 
smallest non-negative real solution of 



Lir-l(M) 
<(t - 1) 



(39) 



For all T < 2 this gives u — 0, and hence 5 = 1, imply- 
ing that a randomly chosen vertex belongs to the giant 
component with probability tending to 1 as k cx3. For 
graphs with r > 2, the probability of belonging to the gi- 
ant component is strictly less than 1, even for infinite k. 
In other words, the giant component essentially fills the 
entire graph for r < 2, but not for r > 2. These results 
have been derived by different means by Aiello et al. [n3| . 



Gi{w*)-w*G[{w*) = 0. 



(43) 



Then x* (and hence s*) is given by Eq. ([42|). Note that 
there is no guarantee that (B3) has a finite solution, and 
that if it does not, then P, will not in general follow the 
form of Eq. (|^). 

When we are precisely at the phase transition of our 
system, we have Gi(l) = G'i{l) = 1, and hence the so- 
lution of Eq. (^^ gives w* — x* = 1 — a result which 
we used above — and s* oo. We can use the fact that 
X* = 1 at the transition to calculate the value of the ex- 
ponent a as follows. Expanding H^^(w) about w* = 1 
by putting w = 1 -|- e in Eq. (B3) , we find that 



(44) 



E. Asymptotic form of the cluster size distribution 

A variety of results are known about the asymptotic 
properties of the coefficients of generating functions, 
some of which can usefully be applied to the distribu- 
tion of cluster sizes Ps generated by Ho{x). Close to the 
phase transition, we expect the tail of the distribution Pg 
to behave as 



P.. 



(40) 



where the constants a and s* can be calculated from the 
properties of Ho{x) as follows. 

The cutoff parameter s* is simply related to the radius 
of convergence |a;*| of the generating function ac- 
cording to 



I 



log \ x* 



(41) 



The radius of convergence \x*\ is equal to the magnitude 
of the position x* of the singularity in Ho{x) nearest to 
the origin. From Eq. (|2^) we see that such a singular- 
ity may arise either through a singularity in Go{x) or 
through one in Hi{x). However, since the first singu- 
larity in G'o( x) is known to be outside the unit circle 
(Section II A), and the first singularity in Hi{x) tends 
to a; = 1 as we go to the phase transition (see below), it 
follows that, sufhciently close to the phase transition, the 
singularity in H(){x) closest to the origin is also a singu- 
larity in Hi{x). With this result x* is easily calculated. 

Although we do not in general have a closed-form ex- 
pression for Hi{x), it is easy to derive one for its func- 
tional inverse. Putting w = Hi(x) and x = H^^(w) in 
Eq. (Eq) and rearranging, we find 



Gi{wy 



(42) 



The singularity of interest corresponds to the point w* 
at which the derivative of H^^{w) is zero, which is a 
solution of 



where we have made use of Gi(l) = G'i{l) = 1 at the 
phase transition. So long as G'/(l) ^ 0, which in general 
it is not, this implies that Hi[x) and hence also Hq(x) 
are of the form 



Hq{x) ^ (1 — x)^ as a; ^ 1, 



(45) 



with /? = i. This exponent is related to the exponent 
a as follows. Equation ( |40| ) implies that Hq{x) can be 
written in the form 



a— 1 oo 

Hoix) ^"^Psx' + CY.s'^e-'/"' x' + e{a), (46) 

s— s—a 

where G is a constant and the last (error) term e(a) is as- 
sumed much smaller than the second term. The first term 
in this expression is a finite polynomial and therefore has 
no singularities on the finite plane; the singularity resides 
in the second term. Using this equation, the exponent /3 
can be written: 



P = lim 



l + (a;-l) 



lim lim 

a — ^oo X — *1 

lim lim 

a — >oo X — >1 



H',{x) 

1 Er=a 



\ ~ X r(3 — a, — alogx) 
x log a; r(2 — a, —a log x) 



(47) 



where we have replaced the sums with integrals as a be- 
comes large, and r(i^, /.t) is the incomplete F-function. 
Taking the limits in the order specified and rearranging 
for a, we then get 



[3 + 1 



(48) 



regardless of degree distribution, except in the special 
case where G'/(l) vanishes (see Eq. (|44|)). The result 
a = I was known previously for the ordinary Poisson 
random graph ||l|, but not for other degree distributions. 
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F. Numbers of neighbors and average path length 

We turn now to the calculation of the number of neigh- 
bors who are m steps away from a randomly chosen ver- 
tex. As shown in Section [I A , the probability distribu- 
tions for first- and second-nearest neighbors are gener- 
ated by the hmctions Go{x) and Go{Gi{x)). By exten- 
sion, the distribution of mth neighbors is generated by 
Go{Gi{. . . Gi{x) . . .)), with m — 1 iterations of the func- 
tion Gi acting on itself. If we define G'^"^\x) to be this 
generating function for mth neighbors, then we have 



G^"'\a 



Go (a;) 

G(™-i)(Gi(x)) 



for m — 1 



(49) 



for m > 2. 

Then the average number Zm of mth-nearest neighbors is 
dG(™) 



da; 



Along with the initial condition zi 
then tells us that 



GUl)G(™-i)'(l)-G'i(l)^™-i. 

(50) 

z = G'„{1), this 



[G[{i)r-'G',{i) 



(51) 



From this result we can make an estimate of the typi- 
cal length t of the shortest path between two randomly 
chosen vertices on the graph. This typical path length is 
reached approximately when the total number of neigh- 
bors of a vertex out to that distance is equal to the num- 
ber of vertices on the graph, i.e., when 



\+^Zm=N. 
m— 1 

Using Eq. (|l|) this gives us 

^ _ log[(jV - 1)(Z2 ~ zi) + zf] ~ \o^zl 

I0g(z2/Zi) 



(52) 



(53) 



In the common case where TV ;3> zi and Z2 ^ zi, this 
reduces to 



log(iV/^i) 

I0g(z2/Zi) 



(54) 



This result is only approximate for two reasons. First, 
the conditions used to derive it are only an approxima- 
tion; the exact answer depends on the detailed structure 
of the graph. Second, it assumes that all vertices are 
reachable from a randomly chosen starting vertex. In 
general however this will not be true. For graphs with 
no giant component it is certainly not true and Eq. ( |5^ ) 
is meaningless. Even when there is a giant component 
however, it is usually not the case that it fills the entire 
graph. A better approximation to I may therefore be 



given by replacing N in Eq. ( |54| ) by NS, where S is the 
fraction of th e gr aph occupied by the giant component, 
as in Section [I E . 



Such shortcomings notwithstanding, there are a num- 
ber of remarkable features of Eq. (p^: 

1. It shows that the average vertex- vertex distance 
for all random graphs, regardless of degree distri- 
bution, should scale logarithmically with size TV, 
according lo ^ — B log TV, where A and B are 
constants. This result is of course well-known for a 
number of special cases. 

2. It shows that the average distance, which is a global 
property, can be calculated from a knowledge only 
of the average numbers of first- and second-nearest 
neighbors, which are local properties. It would be 
possible therefore to measure these numbers em- 
pirically by purely local measurements on a graph 
such as an acquaintance network and from them to 
determine the expected average distance between 
vertices. For some networks at least, this gives a 
surprisingly good estimate of the true average dis- 
tance prf . 

3. It shows that only the average numbers of first- 
and second-nearest neighbors are important to the 
calculation of average distances, and thus that two 
random graphs with completely different distribu- 
tions of vertex degrees, but the same values of z\ 
and 22, will have the same average distances. 

For the case of the purely theoretical example graphs 
we discussed earlier, we cannot make an empirical mea- 
surement of z\ and zi , but we can still employ Eq. (^ ) to 
calculate i. In the case of the ordinary (Poisson) random 
graph, for instance, we find from Eq. (^2|) that z\ = z, 
Z2 = z^, and so ^ = logiV/logz, which is the standard 
result for graphs of this type . For the graph with de- 
gree distributed according to the truncated power law, 
Eq. (|l7|), zi and Z2 are given by Eqs. ( ^ and (^), and 
the average vertex-vertex distance is 



e 



log7V + log[Li,(e-i/«)/Li,_i(e- 



log[Li,_2(c-i/'^)/Li,_i(e-i/«) _ 1 
In the limit k ^ oo, this becomes 

^_ logiV + log[C(r)/C(T-l)] 



1. 



log[C(T-2)/C(r-l)-l] 



+ 1. 



(55) 



(56) 



Note that this expression does not have a finite positive 
real value for any r < 3, indicating that one must specify 
a finite cutoff k for the degree distribution to get a well- 
defined average vertex- vertex distance on such graphs. 



G. Simulation results 

As a check on the results of this section, we have per- 
formed extensive computer simulations of random graphs 
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with various distributions of vertex degree. Such graphs 
are relatively straightforward to generate. First, we gen- 
erate a set of N random numbers {ki} to represent the 
degrees of the N vertices in the graph. These may be 
thought of as the "stubs" of edges, emerging from their 
respective vertices. Then we choose pairs of these stubs 
at random and place edges on the graph joining them 
up. It is simple to see that this will generate all graphs 
with the given set of vertex degrees with equal proba- 
bility. The only small catch is that the sum J^i of 
the degrees must be even, since each edge added to the 
graph must have two ends. This is not difficult to con- 
trive however. If the set {ki} is such that the sum is odd, 
we simply throw it away and generate a new set. 

As a practical matter, integers representing vertex de- 
grees with any desired probability distribution can be 
generated using the transformation method if applica- 
ble, or failing that, a rejection or hybrid method [Q. 
For example, degrees obeying the power-law-plus-cutoff 
form of Eq. can be generated using a two-step hy- 
brid transformation/rejection method as follows. First, 
we generate random integers k > 1 with distribution pro- 
portional to e^*'/" using the transformation W5| 



[-Klog(l - r)], 



(57) 



where r is a random real number uniformly distributed 
in the range < r < 1. Second, we accept this number 
with probability fc^^, where by "accept" we mean that 
if the number is not accepted we discard it and generate 
another one according to Eq. (|57|), repeating the process 
until one is accepted. 

In Fig. ^ we show results for the size of the giant com- 
ponent in simulations of undirected unipartite graphs 
with vertex degrees distributed according to Eq. ( p7| ) for 
a variety of different values of r and k. On the same plot 
we also show the expected value of the same quantity 
derived by numerical solution of Eqs. ( p^ ) and (^5|). As 
the figure shows, the agreement between simulation and 
theory is excellent. 



III. DIRECTED GRAPHS 

We turn now to directed graphs with arbitrary de- 
gree distributions. An example of a directed graph is 
the world-wide web, since every hyperlink between two 
pages on the web goes in only one direction. The web 
has a degree distribution that follows a power-law, as 
discussed in Section |. 

Directed graphs introduce a subtlety that is not 
present in undirected ones, and which becomes impor- 
tant when we apply our generating function formalism. 
In a directed graph it is not possible to talk about 
a "component" — i.e., a group of connected vertices — 
because even if vertex A can be reached by following 
(directed) edges from vertex B, that does not necessarily 
mean that vertex B can be reached from vertex A. There 



c 
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cutoff parameter k 

FIG. 4. The size of the giant component in random graphs 
with vertex degrees distributed according to Eq. (p^, as a 
function of the cutoff parameter k for five different values 
of the exponent r. The points are results from numerical 
simulations on graphs of A'^ = 1 000 000 vertices, and the solid 
lines are the theoretical value for infinite graphs, Eqs. (^) 
and ( |35| ) . The error bars on the simulation results are smaller 
than the data points. 



are two correct generalizations of the idea of the compo- 
nent to a directed graph: the set of vertices which are 
reachable from a given vertex, and the set from which 
a given vertex can be reached. We will refer to these 
as "out-components" and "in-components" respectively. 
An in-component can also be thought of as those ver- 
tices reachable by following edges backwards (but not 
forwards) from a specified vertex. It is possible to study 
directed graphs by allowing both forward and backward 
traversal of edges (see Ref. Q, for example). In this 
case, however, the graph effectively becomes undirected 
and should be treated with the formalism of Section |ll[ 
With these considerations in mind, we now develop 
the generating function formalism appropriate to random 
directed graphs with arbitrary degree distributions. 



A. Generating functions 

In a directed graph, each vertex has separate in-degree 
and out-degree for links running into and out of that ver- 
tex. Let us define pjk to be the probability that a ran- 
domly chosen vertex has in-degree j and out-degree k. 
It is important to realize that in general this joint dis- 
tribution of j and k is not equal to the product PjPk of 
the separate distributions of in- and out-degree. In the 
world-wide web, for example, it seems likely (although 
this question has not been investigated to our knowledge) 
that sites with a large number of outgoing links also have 
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a large number of incoming ones, i.e., that j and k are 
correlated, so that pjk PjPk- We appeal to those work- 
ing on studies of the structure of the web to measure the 
joint distribution of in- and out-degrees of sites; empiri- 
cal data on this distribution would make theoretical work 
much easier! 

We now define a generating function for the joint prob- 
ability distribution of in- and out-degrees, which is nec- 
essarily a function of two independent variables, x and y, 
thus: 



(58) 



Since every edge on a directed graph must leave some ver- 
tex and enter another, the net average number of edges 
entering a vertex is zero, and hence pjk must satisfy the 
constraint 



- ^)Pjk = 0. 



This implies that Q{x,y) must satisfy 



ag 

dx 



dg 

dy 



(59) 



(60) 



where z is the average degree (both in and out) of vertices 
in the graph. 

Using the function g{x,y), we can, as before, define 
generating functions Gq and Gi for the number of out- 
going edges leaving a randomly chosen vertex, and the 
number leaving the vertex reached by following a ran- 
domly chosen edge. We can also define generating func- 
tions Fq and -Fi for the number arriving at such a vertex. 
These functions are given by 



Foix)=g{x,l), Fi{x) 



idg_ 

z dy 

idg 

z dx 



y=i 



(61) 
(62) 



Once we have these functions, many results follow as be- 
fore. The average numbers of first and second neighbors 
reachable from a randomly chosen vertex are given by 
Eq. (|^) and 



z2 = g;,(i)g;(i) = 



d^g 



dxdy 



(63) 



x,y=l 



These are also the numbers of first and second neigh- 
bors from which a random vertex can be reached, since 
Eqs. (|60| ) and (|6^) are manifestly symmetric in x and y. 
We can also make an estimate of the average path length 
on the graph from 



log(A^/^i) 

I0g(z2/Zi) 



+ 1, 



(64) 




FIG. 5. The "bow-tie" diagram proposed by Broder et al. 
as a representation of the giant component of the world-wide 
web (although it can be used to visualize any directed graph) . 



However, this equatio n sho uld be used with 

the derivation of 



as before 

caution. As discussed in Section III F 



this formula assumes that we are in a regime in which 
the bulk of the graph is reachable from most vertices. 
On a directed graph however, this may be far from true, 
as appears to be the case with the world-wide web [|6| . 

The probability distribution of the numbers of vertices 
reachable from a randomly chosen vertex in a directed 
graph — i.e., of the sizes of the out-components — is gen- 
erated by the function Ho{y) — yGo{Hi{y)), where Hi{y) 
is a solution of Hi{y) = yGi{Hi{y)), just as before. (A 
similar and obvious pair of equations governs the sizes 
of the in-components.) The results for the asymptotic 
behavior of the component size distribution from Sec- 
tion II E generalize straightforwardly to directed graphs. 



The average out-component size for the case where there 
is no giant component is given by Eq. (31), and thus the 
point at which a giant component first appears is given 
once more by G'i{l) = 1. Substituting Eq. ( ^8| ) into this 
expression gives the explicit condition 



^(2jfc - 3 - k)pjk = 



(65) 



jk 



for the first appearance of the giant component. This 
expression is the equivalent for the directed graph of 
Eq. (^). It is also possible, and equally valid, to de- 
fine the position at which the giant component appears 
by F[[l) = 1, which provides an alternative derivation 
for Eq. (H). 

Just as with the individual in- and out-components for 
vertices, the size of the giant component on a directed 
graph can also be defined in different ways. The giant 
component can be represented using the "bow-tie" dia- 
gram of Broder et al. ||2^ , which we depict (in simpli- 
fied form) in Fig. ||. The diagram has three parts. The 
strongly connected portion of the giant component, rep- 
resented by the central circle, is that portion in which ev- 
ery vertex can be reached from every other. The two sides 
of the bow-tie represent (1) those vertices from which the 
strongly connected component can be reached but which 
it is not possible to reach from the strongly connected 
component and (2) those vertices which can be reached 
from the strongly connected component but from which 
it is not possible to reach the strongly connected compo- 
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number of accessible sites s 

FIG. 6. The distribution Ps of the numbers of vertices ac- 
cessible from each vertex of a directed graph with identically 
exponentially distributed in- and out-degree. The points are 
simulation results for systems of A'^ = 1 000 000 vertices and 
the solid lines are the analytic solution. 



nent. The solution of Eqs. ( |34D and ( p5D with Ga{x) and 
Gi{x) defined according to Eq. (^) gives the number of 
vertices, as a fraction of N, in the giant strongly con- 
nected component plus those vertices from which the gi- 
ant strongly connected component can be reached. Using 
Fo{x) and Fi{x) (Eq. (|6l|)) in place of G'o(x) and Gi{x) 
gives a different solution, which represents the fraction 
of the graph in the giant strongly connected component 
plus those vertices which can be reached from it. 



B. Simulation results 

We have performed simulations of directed graphs as 
a check on the results above. Generation of random di- 
rected graphs with known joint degree distribution pjk 
is somewhat more complicated th an ge neration of undi- 
rected graphs discussed in Section II G . The method we 
use is as follows. First, it is important to ensure that 
the averages of the distributions of in- and out-degree of 
the graph are the same, or equivalently that pjk satisfies 
Eq. (p9|). If this is not the case, at least to good approx- 
imation, then generation of the graph will be impossible. 
Next, we generate a set of N in/out-degree pairs {ji, ki), 
one for each vertex i, according to the joint distribution 
Pjk, and calculate the sums ji and J2i ^i- These sums 
are required to be equal if there are to be no dangling 
edges in the graph, but in most cases we find that they 
are not. To rectify this we use a simple procedure. We 
choose a vertex i at random, discard the numbers (ji, ki) 
for that vertex and generate new ones from the distribu- 
tion Pjk- We repeat this procedure imtil the two sums 



are found to be equal. Finally, we choose random in/out 
pairs of edges and join them together to make a directed 
graph. The resulting graph has the desired number of 
vertices and the desired joint distribution of in and out 
degree. 

We have simulated directed graphs in which the dis- 
tribution Pjk is given by a simple product of indepen- 
dent distributions of in- and out-degree. (As pointed out 
in Section III A, this is not generally the case for real- 
world directed graphs, where in- and out-degree may be 
correlated.) In Fig. ^ wc show results from simulations 
of graphs with identically distributed (but independent) 
in- and out-degrees drawn from the exponential distri- 
bution, Eq. (13). For this distribution, solution of the 
critical-point equation G'i{l) = 1 shows that the giant 
component first appears at Kc — [log2]~^ = 1.4427. The 
three curves in the figure show the distribution of num- 
bers of vertices accessible from each vertex in the graph 
for K = 0.5, 0.8, and Kc- The cr itica l distribution fol- 
lows a power-law form (see Section p^I C| ) , while the others 
show an exponential cutoff. We also show the exact dis- 
tribution derived from the coefficients in the expansion 
of Hi (x) about zero. Once again, theory and simulation 
are in good agreement. A fit to the distribution for the 
case K = Kc gives a value of a = 1.50 ± 0.02, in good 
agreement with Eq. (^8|). 



IV. BIPARTITE GRAPHS 

The collaboration graphs of scientists, company direc- 
tors, and movie actors discussed in Section || are all ex- 
amples of bipartite graphs. In this section we study the 
theory of bipartite graphs with arbitrary degree distribu- 
tions. To be concrete, we will speak in the language of 
"actors" and "movies," but clearly all the developments 
here are applicable to academic collaborations, boards of 
directors, or any other bipartite graph structure. 



A. Generating functions and basic results 

Gonsider then a bipartite graph of M movies and N 
actors, in which each actor has appeared in an average 
of /i movies and each movie has a cast of average size 
1/ actors. Note that only three of these parameters are 
independent, since the fourth is given by the equality 



M 



V 

N' 



(66) 



Let pj be the probability distribution of the degree of 
actors (i.e., of the number of movies in which they have 
appeared) and qk be the distribution of degree (i.e., cast 
size) of movies. We define two generating functions which 
generate these probability distributions thus: 



Mx) = Y. 



Pf. 



9oix) = 



qkx 



(67) 
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(It may be helpful to think of / as standing for "film," 
in order to keep these two straight.) As before, we nec- 
essarily have 

/o(l)-ffo(l) = l, /o(l)-M, 5o(l)-^- (68) 

If we now choose a random edge on our bipartite graph 
and follow it both ways to reach the movie and actor 
which it connects, then the distribution of the number 
of other edges leaving those two vertices is generated by 
the equivalent of (m: 



/i 



gi{x)^-g'o{x). (69) 



Now we can write the generating function for the distri- 
bution of the number of co-stars (i.e., actors in shared 
movies) of a randomly chosen actor as 



Go{x) = foigiix)). 



(70) 



If we choose a random edge, then the distribution of num- 
ber of co-stars of the actor to which it leads is generated 
by 



Gi(x) = /i(gi(x)). 



(71) 



These two functions play the same role in the one-mode 
network of actors as the functions of the same name did 
for the unipartite random graphs of Section ||. Once 
we have calculated them, all the results from Section || 
follow exactly as before. 

The numbers of first and second neighbors of a ran- 
domly chosen actor are 

z^^G'o{l) = m)g[{l), (72) 
z2^G',{l)G[{l) = mf[{l)[g[{l)r. (73) 

Explicit expressions for these quantities can be obtained 
by substituting from Eqs. (67) and (|6|). The average 
vertex-vertex distance on the one-mode graph is given 
as before by Eq. (p^). Thus, it is possible to estimate 
average distances on such graphs by measuring only the 
numbers of first and second neighbors. 

The distribution of the sizes of the connected compo- 
nents in the one-mode network is generated by Eq. (27), 
where Hi{x) is a solu tion of Eq. (p6|). The asymptotic 
results of Section II E generalize simply to the bipartite 



case, and the average size of a connected component in 
the absence of a giant component is 



1 



G',{1) 
l-G'i(l)' 



(74) 



as before. This diverges when G'i{l) ~ 1, marking the 
first appearance of the giant component. Equivalently, 
the giant component first appears when 



/^'(1).9^,'(1) = /^(1)5^,(1). 



(75) 



Substituting from Eq. (|67|), we then derive the explicit 
condition for the first appearance of the giant component: 



^jk{jk - j - k)pjqk = 0. 



(76) 



The size S of the giant component, as a fraction of the to- 
tal number N of actors, is given as before by the solution 
of Eqs. dH) and (H). 

Of course, all of these results work equally well if "ac- 
tors" and "movies" are interchanged. One can calculate 
the average distance between movies in terms of common 
actors shared, the size and distribution of connected com- 
ponents of movies, and so forth, using the formulas given 
above, with only the exchange of fo and /i for go and 
gi. The formula (|7^) is, not surprisingly, invariant under 
this interchange, so that the position of the onset of the 
giant component is the same regardless of whether one is 
looking at actors or movies. 



B. Clustering 

Watts and Strogatz ||l^ have introduced the concept 
of clustering in social networks, also sometimes called 
network transitivity. Clustering refers to the increased 
propensity of pairs of people to be acquainted with one 
another if they have another acquaintance in common. 
Watts and Strogatz defined a clustering coefRcient which 
measures the degree of clustering on a graph. For our 
purposes, the definition of this coefficient is 



3 X number of triangles on the graph <iN^ 
number of connected triples of vertices N3 



(77) 



Here "triangles" are trios of vertices each of which is con- 
nected to both of the others, and "connected triples" are 
trios in which at least one is connected to both the oth- 
ers. The factor of 3 in the numerator accounts for the fact 
that each triangle contributes to three connected triples 
of vertices, one for each of its three vertices. With this 
factor of 3, the value of C lies strictly in the range from 
zero to one. In the directed and undirected unipartite 
random graphs of Sections || and pl| C is trivially zero 
in the limit N — > 00. In the one-mode projections of bi- 
partite graphs, however, both the actors and the movies 
can be expected to have non-zero clustering. We here 
treat the case for actors. The case for movies is easily 
derived by swapping fs and gs. 

An actor who has z = Zi co-stars in total contributes 
iz(z — 1) connected triples to A'3, so that 



(78) 



where is the probability of having z co-stars. As 
shown above (Eq. ([70[)), the distribution is generated 
by Go{x) and so 
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N3 = iiVG"o'(l). 



(79) 



A movie which stars k actors contributes ifc(fc — l)(fc — 
2) triangles to the total triangle count in the one-mode 
graph. Thus the total number of triangles on the graph 
is the sum of ^k{k — l)(fc — 2) over all movies, which is 
given by 



TVa = ^mE^(^ - - 2)9'^ - iMg'^'il). (80) 

k 



Substituting into Eq. (|77|), we then get 



(81) 



Making use of Eqs. (|6 
written as 



N G^'(l)- 

, and ([70|), this can also be 



j_ _ ^ (M2 - Mi)('^2 - i^if 



(82) 



where Mn — X]fe ^"Pfc is the nth moment of the distribu- 
tion of numbers of movies in which actors have appeared, 
and Vn is the same for cast size (number of actors in a 
movie) . 



And from Eq. (pl|), the clustering coefficient for the one- 
mode network of actors is 



1 



C 



where we have made use of Eq. (^6|) . 

Another quantity of interest is the distribution of num- 
bers of co-stars, i.e., of the numbers of people with whom 
each actor has appeared in a movie. As discussed above, 
this distribution is generated by the function Go{x) de- 
fined in Eq. ( [70[ ) . For the case of the Poisson degree dis- 
tribution, we can perform the derivatives, Eq. (^), and 
setting X = we find that the probability of having 
appeared with a total of exactly z co-stars is 



fe=l 



lie 



(89) 



where the coefficients { ^ } are the Stirling numbers of 
the second kind Ml 



(-1)* 



r!(fc — r)! 



(90) 



C. Example 



D. Simulation results 



To give an example, consider a random bipartite graph 
with Poisson-distributed numbers of both movies per ac- 
tor and actors per movie. In this case, following the 
derivation of Eq. (|lj) , we find that 

/o(x) = e^(--i), go{x) ^ e-'^^-'l (83) 
and fi{x) = fo{x) and ^1(2:) = go{x). Thus 

Go{x) = Giix) = e^(^'''"""-i). (84) 
This implies that zi = fiv and Z2 = [iiv)"^ , so that 
log N log N 



log ^lU log z ' 



(85) 



just as in an ordinary Poisson-distributed random graph. 
From Eq. ([zi]), the average size (s) of a connected com- 
ponent of actors, below the phase transition, is 



1 



1 — /iz^ ' 



(86) 



which diverges, yielding a giant component, aX ^iv = z = 
1, also as in the ordinary random graph. From Eqs. (^) 
and (^5|), the size S of the giant component as a fraction 
of TV is a solution of 



= 1 - e^ 



(87) 



Random bipartite graphs can be generated using an 
algorithm similar to the one described in Section III B 
for directed graphs. After making sure that the required 
degree distributions for both actor and movie vertices 
have means consistent with the required total numbers 
of actors and movies according to Eq. (66), we generate 



vertex degrees for each actor and movie at random and 
calculate their sum. If these sums are unequal, we dis- 
card the degree of one actor and one movie, chosen at 
random, and replace them with new degrees drawn from 
the relevant distributions. We repeat this process until 
the total actor and movie degrees are equal. Then we 
join vertices up in pairs. 

In Fig. we show the results of such a simulation for a 
bipartite random graph with Poisson degree distribution. 
(In fact, for the particular case of the Poisson distribu- 
tion, the graph can be generated simply by joining up 
actors and movies at random, without regard for indi- 
vidual vertex degrees.) The figure shows the distribution 
of the number of co-stars of each actor, along with the 
analytic solution, Eqs. (|89| ) and (po|). Once more, numer- 
ical and analytic results are in good agreement. 



V. APPLICATIONS TO REAL- WORLD 
NETWORKS 

In this section we construct random graph models of 
two types of real-world networks, namely collaboration 
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FIG. 7. The frequency distribution of numbers of co-stars 
of an actor in a bipartite graph with /i = 1.5 and v = 15. The 
points are simulation results for M = 10 000 and iV = 100 000. 
The line is the exact solution, Eqs. and (|9^). The error 
bars on the numerical results are smaller than the points. 



graphs and the world-wide web, using the results of Sec- 
tions [II and IV to incorporate realistic degree distribu- 
tions into the models. As we will show, the results arc in 
reasonably good agreement with empirical data, although 
there are some interesting discrepancies also, perhaps in- 
dicating the presence of social phenomena that are not 
incorporated in the random graph. 



A. Collaboration networks 

In this section we construct random bipartite graph 
models of the known collaboration networks of company 
directors p9|-^, movie actors and scientists [ p6[ . 
As we will see, the random graph works well as a model 
of these networks, giving good order-of-magnitude esti- 
mates of all quantities investigated, and in some cases 
giving results of startling accuracy. 

Our first example is the collaboration network of the 
members of the boards of directors of the Fortune 1000 
companies (the one thousand US companies with the 
highest revenues) . The data come from the 1999 For- 
tune 1000 [||-fll and in fact include only 914 of the 
1000, since data on the boards of the remaining 86 were 
not available. The data form a bipartite graph in which 
one type of vertex represents the boards of directors, and 
the other type the members of those boards, with edges 
connecting boards to their members. In Fig. ^ we show 
the frequency distribution of the numbers of boards on 
which each member sits, and the numbers of members of 
each board. As we see, the former distribution is close to 
exponential, with the majority of directors sitting on only 
one board, while the latter is strongly peaked around 10, 
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FIG. 8. Frequency distributions for the boards of directors 
of the Fortune 1000. Left panel: the numbers of boards on 
which each director sits. Right panel: the numbers of direc- 
tors on each board. 



indicating that most boards have about 10 members. 

Using these distributions, we can define generating 
functions fo{x) and go{x) as in Eq. (|2^), and hence 
find the generating functions Gq{x) and Gi{x) for the 
distributions of numbers of co-workers of the directors. 
We have used these generating functions and Eqs. ( [7^ ) 
and ( ^l| ) to calculate the expected clustering coefficient 
C and the average number of co-workers z in the one- 
mode projection of board directors on a random bipar- 
tite graph with the same vertex degree distributions as 
the original dataset. In Table | we show the results of 
these calculations, along with the same quantities for the 
real Fortune 1000. As the table shows the two are in 
remarkable — almost perfect — agreement . 

It is not just the average value of z that we can cal- 
culate from our generating functions, but the entire dis- 
tribution: since the generating functions are finite poly- 
nomials in this case, we can simply perform the deriva- 
tives to get the probability distribution Tz- In Fig. ||, we 
show the results of this calculation for the Fortune 1000 
graph. The points in the figure show the actual distribu- 
tion of z for the real- world data, while the solid line shows 





clustering C 


average 


degree z 


network 


theory 


actual 


theory 


actual 


company directors 


0.590 


0.588 


14.53 


14.44 


movie actors 


0.084 


0.199 


125.6 


113.4 


physics (arxiv.org) 


0.192 


0.452 


16.74 


9.27 


biomedicine (MEDLINE) 


0.042 


0.088 


18.02 


16.93 



TABLE I. Summary of results of the analysis of four col- 
laboration networks. 
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FIG. 9. The probability distribution of numbers of 
co-directors in the Fortune 1000 graph. The points are the 
real-world data, the solid line is the bipartite graph model, 
and the dashed line is the Poisson distribution with the same 
mean. Insets: the equivalent distributions for the numbers of 
collaborators of movie actors and physicists. 



the theoretical results. Again the agreement is excellent. 
The dashed line in the figure shows the distribution for 
an ordinary Poisson random graph with the same mean. 
Clearly this is a significantly inferior fit. 

In fact, within the business world, attention has fo- 
cussed not on the collaboration patterns of company di- 
rectors, but on the "interlocks" between boards, i.e., on 
the one-mode network in which vertices represent boards 
of directors and two boards are connected if they have one 
or more directors in common |28| , p9|] . This is also simple 
to study with our model. In Fig. |1C| we show the distri- 
bution of the numbers of interlocks that each board has, 
along with the theoretical prediction from our model. As 
we see, the agreement between empirical data and theory 
is significantly worse in this case than for the distribution 
of co-directors. In particular, it appears that our theory 
significantly underestimates the number of boards which 
are interlocked with very small or very large numbers of 
other boards, while over estimating those with interme- 
diate numbers of interlocks. One possible explanation of 
this is that "big-shots work with other big-shots." That 
is, the people who sit on many boards tend to sit on those 
boards with other people who sit on many boards. And 
conversely the people who sit on only one board (which 
is the majority of all directors), tend to do so with others 
who sit on only one board. This would tend to stretch 
the distribution of numbers of interlocks, just as seen 
in figure, producing a disproportionately high number of 
boards with very many or very few interlocks to others. 
To test this hypothesis, we have calculated, as a function 
of the number of boards on which a director sits, the av- 
erage number of boards on which each of their codirectors 
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FIG. 10. The distribution of the number of other boards 
with which each board of directors is "interlocked" in the 
Fortune 1000 data. An interlock between two boards means 
that they share one or more common members. The points are 
the empirical data, the solid line is the theoretical prediction. 
Inset: the number of boards on which one's codirectors sit, as 
a function of the number of boards one sits on oneself. 



sit. The results are shown in the inset of Fig. 10. If these 
two quantities were uncorrelated, the plot would be flat. 
Instead, however, it slopes clearly upwards, indicating in- 
deed that on the average the big-shots work with other 
big-shots. (This idea is not new. It has been discussed 
previously by a number of others — see Refs. and [Q , 
for example.) 

The exairrple of the boards of directors is a particu- 
larly instructive one. What it illustrates is that the cases 
in which our random graph models agree well with real- 
world phenomena are not necessarily the most interest- 
ing. Certainly it is satisfying, as in Fig. ^, to have the 
theory agree well with the data. But probably Fig. ( p^ ) 
is more instructive: we have learned something about 
the structure of the network of the boards of directors by 
observing the way in which the pattern of board inter- 
locks differs from the predictions of the purely random 
network. Thus it is perhaps best to regard our random 
graph as a null model — a baseline from which our expec- 
tations about network structure should be measured. It 
is deviation from the random graph behavior, not agree- 
ment with it, that allows us to draw conclusions about 
real- world networks. 

We now look at three other graphs for which our theory 
also works well, although again there are some noticeable 
deviations from the random graph predictions, indicating 
the presence of social or other phenomena at work in the 
network. 

We consider the graph of movie actors and the movies 
in which they appear |l^,^ and graphs of scientists 
and the papers they write in physics and biomedical re- 
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search [Q. In Table | we show results for the cluster- 
ing coefficients and average coordination numbers of the 
one-mode projections of these graphs onto the actors or 
scientists. As the table shows, our theory gives results for 
these figures which are of the right general order of mag- 
nitude, but typically deviate from the empirically mea- 
sured figures by a factor of two or so. In the insets of 
Fig. ^ we show the distributions of numbers of collabo- 
rators in the movie actor and physics graphs, and again 
the match between theory and real data is good, but not 
as good as with the Fortune 1000. 

The figures for clustering and mean numbers of col- 
laborators are particularly revealing. The former is uni- 
formly about twice as high in real life as our model pre- 
dicts for the actor and scientist networks. This shows 
that there is a significant tendency to clustering in these 
networks, in addition to the trivial clustering one expects 
on account of the bipartite structure. This may indicate, 
for example, that scientists tend to introduce pairs of 
their collaborators to one another, thereby encouraging 
clusters of collaboration. The figures for average numbers 
of collaborators show less deviation from theory than the 
clustering coefficients, but nonetheless there is a clear 
tendency for the numbers of collaborators to be smaller 
in the real-world data than in the models. This probably 
indicates that scientists and actors collaborate repeat- 
edly with the same people, thereby reducing their total 
number of collaborators below the number that would 
naively be expected if we consider only the numbers of 
papers that they write or movies they appear in. It would 
certainly be possible to take effects such as these into 
account in a more sophisticated model of collaboration 
practices. 



B. The world-wide web 

In this section we consider the application of our theory 
of random directed graphs to the model ing o f the world- 
wide web. As we pointed out in Section [II A, it is not at 



present possible to make a very accurate random-graph 
model of the web, because to do so we need to know the 
joint distribution pjk of in- and out-degrees of vertices, 
which has not to our knowledge been measured. How- 
ever, we can make a simple model of the web by assum- 
ing in- and out-degree to be independently distributed 
according to their known distributions. Equivalently, we 
assume that the joint probability distribution factors ac- 
cording to pjk = p^Qk- 

Broder et al. |26|| give results showing that the in- 
and out-degree distributions of the web are approxi- 
mately power- law in form with exponents Tin = 2.1 and 
Tout = 2.7, although there is some deviation from the 
perfect power law for small degree. In Fig. |l^, we show 
histograms of their data with bins chosen to be of uniform 
width on the logarithmic scales used. (This avoids certain 
systematic errors known to afflict linearly histogrammed 
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FIG. 11. The probability distribution of in-degree (left 
panel) and out-degree (right panel) on the world-wide web, 
rebinned from the data of Broder et al. The solid lines 

are best fits of form ffill). 



data plotted on log scales.) We find both distributions 
to be well-fitted by the form 



Pk^C{k + ko)-\ 



(91) 



where the constant C is fixed by the requirement of nor- 
malization, taking the value \/ C,{t, kg), were Cix, y) is the 
generalized C- function [Q. The constants fco and r are 
found by least-squares fits, giving values of 0.58 and 3.94 
for fcg, and 2.17 and 2.69 for r, for the in- and out-degree 
distributions respectively, in reasonable agreement with 
the fits performed by Broder et al. With these choices, 
the data and Eq. ( |9l| ) match closely (see Fig. |ll| again). 

Neither the raw data nor our fits to them satisfy the 
constraint (^), that the total number of links leaving 
pages should equal the total number arriving at them. 
This is because the data set is not a complete picture of 
the web. Only about 200 million of the web's one billion 
or so pages were included in the study. Within this sub- 
set, our estimate of the distribution of out-degree is pre- 
sumably quite accurate, but many of the outgoing links 
will not connect to other pages within the subset studied. 
At the same time, no incoming links which originate out- 
side the subset of pages studied are included, because the 
data are derived from "crawls" in which web pages are 
found by following links from one to another. In such a 
crawl one only finds links by finding the pages that they 
originate from. Thus our data for the incoming links is 
quite incomplete, and we would expect the total number 
of incoming links in the dataset to fall short of the num- 
ber of outgoing ones. This indeed is what we see. The 
totals for incoming and outgoing links are approximately 
2.3 X 10*^ and 1.1 x 10^. 

The incompleteness of the data for incoming links lim- 
its the information we can at present extract from a ran- 



16 



dom graph model of the web. There are however some 
calculations which only depend on the out-degree distri- 
bution. 

Given Eq. (|9l|), the generating functions for the out- 
degree distribution take the form 



Go{x) = Gi{x) = . , . , 



(92) 



where ^(x,y,z) is the Lerch ^-function |46). The cor- 
responding generating functions Fq and Fi we cannot 
calculate accurately because of the incompleteness of the 
data. The equality Gq = Gi (and also Fg = Fi) is a gen- 
eral property of all directed graphs for which pjk — pjQk 
as above. It arises because in such graphs in- and out- 
degree are uncorrelated, and therefore the distribution of 
the out-degree of a vertex does not depend on whether 
you arrived at it by choosing a vertex at random, or by 
following a randomly chosen edge. 

One property of the web which we can estimate from 
the generating functions for out-degree alone is the frac- 
tion Sin of the graph taken up by the giant strongly con- 
nected component plus those sites from which the giant 
strongly connected component can be reached. This is 
given by 



l-Go(l-5in). 



(93) 



In other words, 1 — Sin is a fixed point of Go (a;). Using 
the measured values of fco and r, we find by numeri- 
cal iteration that that Sin = 0.527, or about 53%. The 
direct measurements of the web made by Broder et al. 
show that in fact about 49% of the web falls in Sin, in 
reasonable agreement with our calculation. Possibly this 
implies that the structure of the web is close to that of 
a directed random graph with a power-law degree distri- 
bution, though it is possible also that it is merely coinci- 
dence. Other comparisons between random graph models 
and the web will have to wait until we have more accurate 
data on the joint distribution pjk of in- and out-degree. 



VI. CONCLUSIONS 

In this paper we have studied in detail the theory of 
random graphs with arbitrary distributions of vertex de- 
gree, including directed and bipartite graphs. We have 
shown how, using the mathematics of generating func- 
tions, one can calculate exactly many of the statistical 
properties of such graphs in the limit of large numbers 
of vertices. Among other things, we have given explicit 
formulas for the position of the phase transition at which 
a giant component forms, the size of the giant compo- 
nent, the average and distribution of the sizes of the 
other components, the average numbers of vertices a cer- 
tain distance from a given vertex, the clustering coeffi- 
cient, and the typical vertex-vertex distance on a graph. 
We have given examples of the application of our the- 
ory to the modeling of collaboration graphs, which are 



inherently bipartite, and the World-Wide web, which is 
directed. We have shown that the random graph theory 
gives good order-of-magnitude estimates of the properties 
of known collaboration graphs of business-people, scien- 
tists and movie actors, although there are measurable 
differences between theory and data which point to the 
presence of interesting sociological effects in these net- 
works. For the web we are limited in what calculations 
we can perform because of the lack of appropriate data to 
determine the generating functions. However, the calcu- 
lations we can perform agree well with empirical results, 
offering some hope that the theory will prove useful once 
more complete data become available. 
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